Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13019][Docs] Replace example code in mllib-statistics.md using include_example #11108

Closed
wants to merge 24 commits into from

Conversation

keypointt
Copy link
Contributor

https://issues.apache.org/jira/browse/SPARK-13019

The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6.

Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example.
{% include_example scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}
Jekyll will find examples/src/main/scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala and pick code blocks marked "example" and replace code block in
{% highlight %}
in the markdown.

See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337

@mengxr
Copy link
Contributor

mengxr commented Feb 11, 2016

ok to test

@mengxr
Copy link
Contributor

mengxr commented Feb 11, 2016

cc @yinxusen

@SparkQA
Copy link

SparkQA commented Feb 11, 2016

Test build #51145 has finished for PR 11108 at commit a4dd0fb.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 16, 2016

Test build #51352 has finished for PR 11108 at commit 0df3e65.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 20, 2016

Test build #51602 has finished for PR 11108 at commit d817d0b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 21, 2016

Test build #51646 has finished for PR 11108 at commit f945222.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 23, 2016

Test build #51800 has finished for PR 11108 at commit aec10ca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


object SummaryStatisticsExample {

def main(args: Array[String]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def main(args: Array[String]):Unit = {

# $example off$

if __name__ == "__main__":
sc = SparkContext(appName="HypothesisTestingKolmogorovSmirnovTestExample") # SparkContext
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment

@mengxr
Copy link
Contributor

mengxr commented Mar 16, 2016

@keypointt @yinxusen I made one pass and left some comments inline. One issue is whether we want to print empty lines. It doesn't seem necessary to me for example code appearing in the user guide. There are also some redundant comments, which we should remove. But overall, this look great!

@yinxusen
Copy link
Contributor

@mengxr If there is no empty line, we will get a bulk of results which are hard to tell and recognize. I think it's a good experience for users that they can use ./bin/run-example ml.xxxExample to see some tidy outputs directly.

@mengxr
Copy link
Contributor

mengxr commented Mar 17, 2016

I agree. But we don't really need lines just to print empty lines. For example:

println(goodnessOfFitTestResult)
println()

could be simply

println(s"$goodnessOfFitTestResult\n")

It is good to keep the example code short.

@yinxusen
Copy link
Contributor

Sure, that makes sense.

Sent from my iPhone

On Mar 17, 2016, at 11:04, Xiangrui Meng [email protected] wrote:

I agree. But we don't really need lines just to print empty lines. For example:

println(goodnessOfFitTestResult)
println()
could be simply

println(s"$goodnessOfFitTestResult\n")
It is good to keep the example code short.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

@SparkQA
Copy link

SparkQA commented Mar 17, 2016

Test build #53460 has finished for PR 11108 at commit a4eb28d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@keypointt
Copy link
Contributor Author

Hi @mengxr how do you like this change?

@mengxr
Copy link
Contributor

mengxr commented Mar 21, 2016

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in 1af8de2 Mar 21, 2016
@keypointt
Copy link
Contributor Author

Thank you!

@yhuai
Copy link
Contributor

yhuai commented Mar 21, 2016

Seems this one breaks scala 2.10 compilation (https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-sbt-scala-2.10/547/console). Can you fix it?

@keypointt
Copy link
Contributor Author

Oh sure I'll try to fix it asap

@keypointt
Copy link
Contributor Author

@yhuai I've just tested on my local machine, and there is no compile error and the example output results correctly.

Also it seems all the parameters are right according to method documentation for sampleByKey() and sampleByKeyExact() http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

And the build before merging to master was also successful...I'm not quite sure why this master build failed...

@mengxr
Copy link
Contributor

mengxr commented Mar 22, 2016

I reverted the change in master to unblock other builds. @keypointt Could you test it on Scala 2.10? It seems that Scala 2.10 doesn't resolve keyword arguments nicely:

def f(a: String, b: String, c: String = "xxx") = {
  println(a, b, c)
}

f(a = "abc", "def")
<console>:34: error: not enough arguments for method f: (a: String, b: String, c: String)Unit.
Unspecified value parameter b.
              f(a = "abc", "def")

We should either fix the seed or use keyword argument for fractions.

@keypointt
Copy link
Contributor Author

Oh yes I just tried 2.10 and 2.11, where 2.11 build succeeded but 2.10 failed.

@keypointt
Copy link
Contributor Author

@mengxr I've add keyword argument at my forked repo: keypointt@892fe60
And for demonstration purpose, should 'seed' be explicitly added? Or I should remove the val 'seed' in this commit?

should I open a separate PR? since new commit is not syncing to this PR

roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
… include_example

https://issues.apache.org/jira/browse/SPARK-13019

The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6.

Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example.
`{% include_example scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}`
Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala` and pick code blocks marked "example" and replace code block in
`{% highlight %}`
 in the markdown.

See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337

Author: Xin Ren <[email protected]>

Closes apache#11108 from keypointt/SPARK-13019.
@keypointt
Copy link
Contributor Author

Hi @mengxr , I'm using below command to compile this branch (after merge in latest master branch) by scala-2.10
build/sbt -Pyarn -Phadoop-2.3 -Pkinesis-asl -Pspark-ganglia-lgpl -Phive -Phive-thriftserver -Pscala-2.10 compile test:compile

and I'm always getting this error sbt.ResolveException: unresolved dependency: org.scala-lang#jline;2.11.7: not found where the jline version is set at

spark/pom.xml

Line 157 in 48978ab

<jline.version>${scala.version}</jline.version>

I'm assuming that master branch now is using scala-2.11 and when building 2.10 there could be some resolve issues?

@mengxr
Copy link
Contributor

mengxr commented Mar 22, 2016

asfgit pushed a commit that referenced this pull request Mar 24, 2016
… mllib-statistics.md using include_example

## What changes were proposed in this pull request?

This PR for ticket SPARK-13019 is based on previous PR(#11108).
Since PR(#11108) is breaking scala-2.10 build, more work is needed to fix build errors.

What I did new in this PR is adding keyword argument for 'fractions':
`    val approxSample = data.sampleByKey(withReplacement = false, fractions = fractions)`
`    val exactSample = data.sampleByKeyExact(withReplacement = false, fractions = fractions)`

I reopened ticket on JIRA but sorry I don't know how to reopen a GitHub pull request, so I just submitting a new pull request.
## How was this patch tested?

Manual build testing on local machine, build based on scala-2.10.

Author: Xin Ren <[email protected]>

Closes #11901 from keypointt/SPARK-13019.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants